Complete a data analytics project that demonstrates your mastery of the course content.
After several years of operation, the Lending Club™ wants to understand how the individual components (i.e., features) of a loan, such as the amount loaned, the term length, or the interest rate, affect the profitability of a making a specific loan. In particular, they're interested in understanding how likely a loan is to be repaid and what type of return they can expect. In this project, you will use the provided loans dataset to help the Lending Club™ figure out which loans are most profitable.
You will work in groups of 4-5 students to analyze these data to make recommendations based on the variables in the loan.csv dataset. Specifically, you should address the following questions:
You will complete three tasks for this group project:
Your final group report will be a single Jupyter notebook that will integrate Markdown, Python code, and the results from your code, such as data visualizations. Markdown cells should be used to explain any decisions you make regarding the data, to discuss any plots or visualizations generated in your notebook, and the results of your analysis. As a general guideline, the content should be written in a way that a fellow classmate (or an arbitrary data scientist/analyst) should be able to read your report and understand the results, implications, and processes that you followed to achieve your result. If printed (not that you should do this), your report should be at least fifteen pages.
Your group will present the material in-class in a format that is left up to each group. For example, you can use presentation software such as MS Powerpoint, PDFs, your Notebook, or Prezi, or, alternatively, you can choose some other presentation style (feel free to discuss your ideas with the course staff). The presentations should cover all steps in your analytics process and highlight your results. The presentation should take between eight to twelve minutes, and will be graded by your discussion teaching assistant.
Your report should
Preprocess all data appropriately.
Build a classifier on Training dataset to classify loans as either repaid or not.
Build a regression model on Training dataset to predict the loan return.
Summarize the results of your analysis. This summary should include anything interesting you found when performing EDA. Also, discuss the results of each classification and regression model. Be sure to address whether your classifier was much better than random on the test data, and comment on how accurate the predictions were from your regression model. Next, be sure to discuss the importance of each feature in both machine learning tasks. Finally, comment on your results and how they might be used to improve the performance of future loans made by the Lending Tree™.
In order to ensure everyone starts at the same point, we provide the following Code cell that creates the two target columns that you will use for the analyses required to complete this group project. The return feature encodes the return for each loan and the repaid feature encodes whether the loan was repaid in full or not (for simplicity, we assume that the loan is repaid if the borrower pays more than the loan amount).
You should include these code cells in your own group project notebook.
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import warnings
warnings.filterwarnings("ignore")
loan = pd.read_csv('loan.csv', low_memory=False)
loan['return'] = (loan['total_pymnt_inv'] - loan['funded_amnt_inv']) / loan['funded_amnt_inv']
loan['repaid'] = loan['total_pymnt_inv'] > loan['funded_amnt_inv']
First of all, we encode the categorical features to make them applicable to SelectKBest models. Then we defined a function to preview the missing values and % of missing values in each column. As too many missing values will create a disturbance to our results, we dropped columns('emp_title', 'desc', 'mths_since_last_delinq', 'mths_since_last_record', 'next_pymnt_d', 'mths_since_last_major_derog', 'tot_coll_amt','tot_cur_bal', 'total_rev_hi_lim', 'Year', 'policy_code', 'id', 'member_id') with over a 5% missing value. Besides, we also drop rows containing missing value, nan and strings to clean our dataframe and drop columns with a low variance as little information was contained. As the result, the new dataframe was more pure and involved only useful information.
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# create a new Dataframe by copying the original one so that subsequent processing will not influence each other
newloan = loan.copy(deep=True)
# encode each categorical features according to its own categories
newloan['grade'] = newloan.grade.astype('category', categories=['G', 'F', 'E', 'D', 'C', 'B', 'A']).cat.codes
newloan['term'] = newloan.term.astype('category', categories=['60 months', '36 months']).cat.codes
newloan['home_ownership'] = newloan.home_ownership.astype('category', categories=['NONE', 'OTHER', 'RENT', 'MORTGAGE', 'OWN']).cat.codes
newloan['verification_status'] = newloan.verification_status.astype('category', categories=['Not Verified', 'Source Verified', 'Verified']).cat.codes
newloan['addr_state'] = newloan.addr_state.astype('category').cat.codes
newloan['purpose'] = newloan.purpose.astype('category').cat.codes
newloan['initial_list_status'] = newloan['initial_list_status'].astype('category').cat.codes
newloan['repaid_label'] = newloan.repaid.astype(int)
# drop columns with nan & strings
newloan = newloan.dropna(axis=1,how='all')
# deine a function to see the missing values
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
missing_values_table(newloan)
# drop columns that contain over 5% missing value
newloan.drop(['emp_title','desc','mths_since_last_delinq','mths_since_last_record','next_pymnt_d','mths_since_last_major_derog',\
'tot_coll_amt','tot_cur_bal','total_rev_hi_lim','Year','policy_code', 'id', 'member_id','collections_12_mths_ex_med' ], axis=1 , inplace=True,errors= 'ignore')
# select numerical columns
cols = newloan.columns
num_cols = newloan._get_numeric_data().columns
num_loan = newloan[num_cols].copy()
# drop columns with low variance
newloan.drop(newloan.std()[newloan.std() < 0.2].index.values, axis=1)
# drop columns with nan & strings
numloan= num_loan.dropna()
featuresloan = numloan[numloan.T[numloan.dtypes!=np.object].index]
First, we used the SelectKBest model to compute the related score of each feature. Then we chose 20 features with over 100 scores. In the following part, we will draw histograms, describe statistics and show plots for the 20 features and select 9 for machine learning part according to their characteristics reflected.
# SelectBest model
features = featuresloan.iloc[:,2:-2]
labels= featuresloan[[-1]]
skb = SelectKBest(k='all')
fs=skb.fit(features, labels)
for var, name in sorted(zip(fs.scores_, featuresloan.columns), key=lambda x: x[0], reverse=True):
print(f'{name:>18} score = {var:5.3f}')
Histograms above show the distribution for each selected categorical feature. It's obvious and clear for us to figure out how many kinds of categories each feature has, which is most popular and which is least. Grade, for example, has 7 categories and B is the most common one while G is the least. As for home ownership, rent and mortgage are almost the same in the first place followed by own, while other and none are the least common one. For features that just have 2 categories, such as term and initial_list_status, we can compare between these 2 and find the greater or common one. For instance, 36 months is more popular than 60 months probably due to interest rate or the credit.
f, ax = plt.subplots(figsize=(10, 8))
# adjust the order of grade for better view
ax=sns.countplot('grade',data=loan, order=["A", "B","C","D","E","F","G"])
ax.set_xlabel('Grade', fontsize=15)
ax.set_ylabel('Frequency', fontsize=15)
plt.show()
# use for loop to draw most categorical features' histograms
x=['initial_list_status', 'home_ownership', 'pub_rec', 'term']
for i in range(len(x)):
f, ax = plt.subplots(figsize=(10, 8))
ax=sns.countplot(x[i],data=loan)
ax.set_xlabel(x[i], fontsize=15)
ax.set_ylabel('Frequency', fontsize=15)
plt.show()
# draw seperately in large size to view better becase of too many different kinds of categories in 'purpose'
f, ax = plt.subplots(figsize=(25, 8))
ax=sns.countplot(x="purpose",data=loan)
ax.set_xlabel('Purpose', fontsize=15)
ax.set_ylabel('Frequency', fontsize=15)
plt.show()
From all these histograms above, we can see clearly that total payment, Interest received to date, funded amount, funded amount by investors, interest rate, total number of credit lines, loan amount, and dti features look well dispersed, while last payment amount, remaining outstanding principal, remaining outstanding principal by investors, recoveries and late fees received to date just clump around a few points. It is worth mentioning that the plot of dti (the ratio) and total number of credit lines look like normal distributions a lot. As for missing value, since it is required to not have any missing value before drawing the histogram, so all the histograms presented are plotted without missing values. To see the missing value of each features, you can see the missing value table presented above.
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']
# use for loop to draw all the numerical features' histograms
for i in range(len(y)):
ax = sns.distplot(featuresloan[y[i]], kde=True)
ax.set_xlabel(y[i], fontsize=14)
ax.set_ylabel("Frequency", fontsize=14)
plt.show()
After we use the .describe() function , it can show the result in verbal version like which is the top categories and the frequency of it. For example, the most popular category of home ownership is rent and for the purpose is debt consolidation. Regarding term, the more common one is 36 months probably because people prefer short-term loans in a lower interest rate. We can also find that the most popular category in initial list status is fractional(f) far more than the whole. So this is also a good and straight-forward way for us to analyze the features and data besides using the histogram.
We think for numerical features,the descriptive statistics are much easier for us to analyze and understand the feature. For example, we can figure out that the maximum value for last_pymnt_amnt is 36115, far more than the 75% point(3806) due to the high standard deviation. As to loan_amnt, most amounts are around 1000 to 1700 with a standard deviation of 7883. Interest rate, however, may have a smaller range with the maximum 24.89% and minimum 5.42%. But still it is a huge range for the borrower. It is interesting to see that out_prncp and out_prncp_inv are similar with a really high range, which means that some repaid all the money while others cannot afford a large amount of debt and cannot repay. We can use descriptive statistics to analyze many other features.
x=['grade', 'initial_list_status', 'home_ownership', 'term', 'purpose']
# use for loop to describe all the categorical features
for i in range(len(x)):
a = loan[x[i]].describe()
print(a)
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']
# use for loop to describe all the numerical features
for i in range(len(y)):
b = featuresloan[y[i]].describe()
print(b)
I think for numerical features,the descriptive statistics are much easier for us to analyze and understand the feature. For example, we can figure out that the maximum value for last_pymnt_amnt is 36115, far more than the 75% point(3806) due to the high standard deviation. We can use descriptive statistics to analyze other features.
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(featuresloan['return'], normed=True, bins=50)
x = plt.xticks()[0]
# get minimum and maximum value of x
xmin, xmax = min(x), max(x)
# evenly spaced numbers over a interval
lnspc = np.linspace(xmin, xmax, len(featuresloan['return']))
# get mean and standard deviation
m, s = stats.norm.fit(featuresloan['return'])
# now get theoretical values in our interval
pdf = stats.norm.pdf(lnspc, m, s)
# plot the histogram and fit the normal distribution
plt.plot(lnspc, pdf, label="Norm")
ax.set_xlabel("return(%)", fontsize=18)
ax.set_ylabel("Frequency", fontsize=18)
ax.legend(loc=1, fontsize = 'x-large')
plt.show()
fig, ax = plt.subplots(figsize=(18,10))
ax.hist(featuresloan['repaid'], normed=True, bins=50)
x = plt.xticks()[0]
# get minimum and maximum value of x
xmin, xmax = min(x), max(x)
# evenly spaced numbers over a interval
lnspc = np.linspace(xmin, xmax, len(featuresloan['repaid']))
# get mean and standard deviation
m, s = stats.norm.fit(featuresloan['repaid'])
# now get theoretical values in our interval
pdf_g = stats.norm.pdf(lnspc, m, s)
# plot the histogram and fit the normal distribution
plt.plot(lnspc, pdf_g, label="Norm")
ax.set_xlabel("repaid", fontsize=18)
ax.set_ylabel("Frequency", fontsize=18)
ax.legend(loc=1, fontsize = 'x-large')
plt.show()
So we can see clearly from the histograms that return feature is more well-fitted to the normal distribution than repaid. This is probably because repaid feature is categorical feature with just two categories.
We used violin plots for categorical features and used scatter plots for numerical features. Because 'total_pymnt_inv','funded_amnt_inv', and 'total_pymnt' are related to compute 'repaid' and 'return', we have not considered these features into our consideration. Next we have chosen some plots more representive involving 'grades', 'home_ownership', 'pub_rec', 'term','last_pymnt_amnt','total_rec_int','out_prncp','out_prncp_inv','recoveries','int_rate'.
From this part we can see:
For 'grades', I think its strong related to the target feature. We draw the point that if the account repaid money, the return increase as grades go bad, and that if the account is failed to repay, the return decrease as grades going bad. From our point of view, this is because as the grades going bad, the individual or company etc, needs to pay more financing cost to borrow the same amount of money. So if it returns the money the return can be high, otherwise it can be low. Besides, we can see an interesting point that people with low grade are more likely to borrow money as cover area for lower grade part is larger. We think maybe low-grade-people are more likely to be in the situation lacking of funding.
For 'term', we can also see the strong relationship with target feature. Longer the term, better the return results. We think the reason is that longer term involves more risk that the borrower will not repaid money, so the company will charge more return to balance the risk. The interesting point is that if individual failed to repay, the mean amount of return owed for longer term is less than which for the shorter. The reason we guess is the interests are higher for long-term, the borrower has repaid more money before although it has not repaid the full amount.
For 'home_ownership', we used to think people who own home might have more possibility to repay money because it means they can afford to buy a house or an apartment. We cannot see the apparent relationship of target features with different catagories of home ownership since the cover area(number of people) and mean amount of return of different types show no difference.
# Define plot layout
# adjust the order of grade for better view
axs = sns.violinplot(x="grade", y="return", hue="repaid", data=loan, order=["A", "B","C","D","E","F","G"])
sns.despine(left=True, offset=10)
plt.show()
x=['initial_list_status', 'home_ownership', 'pub_rec', 'term']
# use for loop to draw most categorical features' violinplot
for i in range(len(x)):
f, ax = plt.subplots(figsize=(10, 8))
ax=sns.violinplot(x[i],y="return", hue="repaid", data=loan)
ax.set_xlabel(x[i], fontsize=15)
ax.set_ylabel('Return', fontsize=15)
plt.show()
# draw seperately in large size to view better becase of too many different kinds of categories in 'purpose'
f, ax = plt.subplots(figsize=(25, 10))
ax=sns.violinplot(x="purpose",y="return", hue="repaid", data=loan)
ax.set_xlabel('Purpose', fontsize=15)
ax.set_ylabel('Return', fontsize=15)
plt.show()
As these features are continous variables, we think scatter plot can be suitable to show all the data.
For 'last_pymnt_amnt', in our point of view, it has much to do with the target features. The interesting thing is if the borrower failed to repay, the last payment amount is very low.We think this is because their last payment amount is not the principal amount, and is just some interest. So, if we know the last payment amount, we have the possibilty to know if the borrower will repay and how much return we will get from them. For exampe, if the amount is high, it nearly means we got the repaid.
For 'total_rec_int', we think it also matters to predict return and repay. The scatter plot shows that large number of borrowers who failed to repay the principal amount paid relatively less interest than those who repaid. Besides, we can see more of the total interests and the less loss we got(return) from the borrowers.
For 'out_prncp', there is some relationship between it with the target features. We can see the more the remaining outstanding principal amount, the less return we will get from the borrowers. Besides we can see that the amount of remaining outstanding principal of borrowers who failed to repay is centralized around 10000 while which for those repaid is centralized in a very low amount. This does make sense because if repaid, there will be no remaining principal except that the borrower has borrowed money with high interest rates so total payment is larger than the funded amount although principal has not been totally paid back.
For'out_prncp_inv', it just shows the same characteristics as the 'out_prncp' feature.
For'recoveries', we think this feature has something to do with the target feature. Because we can see that larger part of people with the post charge off recovery amount have not repaid finally, so we can use it to predict. But there remains an interesting thing that the larger the amount, the less return we will get from people paid back but less money lossed from borrowers who failed to repay.
For 'int_rate', the relationship is strong. For people who repaid, of course we will get more return if the interest rate is higher,so we can use it to calculate how much return we will get. But we have not seen if the money will be repaid from this plot.
# select samples from data for a better view of the scatterplot
aloan=loan.sample(8000)
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']
# use for loop to draw all the numerical features' scatterplot
for i in range(len(y)):
plt.figure(figsize=(20,10))
plt.scatter(aloan[y[i]][aloan['repaid'] == True], aloan['return'][aloan['repaid'] == True],label='True')
plt.scatter(aloan[y[i]][aloan['repaid'] == False], aloan['return'][aloan['repaid'] == False],label='False')
plt.xlabel(y[i], fontsize=18)
plt.ylabel('return', fontsize=16)
plt.show()
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
% matplotlib inline
# Standard imports
import seaborn as sns
sns.set(style="white")
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold, cross_val_score,cross_val_predict
import random
fea_sel_reg = featuresloan[['last_pymnt_amnt','total_rec_int','out_prncp', 'recoveries','out_prncp_inv','int_rate']].as_matrix()
return_data = featuresloan[['return']].as_matrix()
repaid_data = featuresloan['repaid']
## for regression
reg_train, reg_test, return_train, return_test= train_test_split(fea_sel_reg,return_data,test_size=0.4, random_state=23)
## for classification
X_train,X_test, y_train, y_test= train_test_split(fea_sel_reg,repaid_data,test_size=0.4, random_state=23)
#Classification method 1
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg = log_reg.fit(X_train, y_train)
from sklearn import metrics
prediction_test = log_reg.predict(X_test)
print(100*metrics.accuracy_score(y_test, prediction_test))
#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
labels = ['Non-Repaid', 'Repaid']
# Create and display confusion matrix
print(confusion_matrix(y_test, prediction_test))
# Probability
print(classification_report(y_test, prediction_test, \
target_names = labels))
#Confusion matrix plot
class_names = ['Non-Repaid', 'Repaid']
import itertools
class_names = ['Non-Repaid', 'Repaid']
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
conf_log = confusion_matrix(y_test, prediction_test)
np.set_printoptions(precision=2)
#cnf_matrix = confusion_matrix(y_test, y_pred)
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_log, classes=class_names, normalize=True,
title='Normalized confusion matrix')
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_log, classes=class_names,
title='Confusion matrix, without normalization')
plt.show()
#sklearn
# Classification method 2
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
# Fit estimator to scaled training data
rfc = rfc.fit(X_train, y_train)
predict_test = rfc.predict(X_test)
print(100*metrics.accuracy_score(y_test, predict_test))
#Confusion Matrix
from sklearn.metrics import confusion_matrix
# Create and display confusion matrix
print(confusion_matrix(y_test, predict_test))
#Probability
print(classification_report(y_test, predict_test, \
target_names = labels))
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
conf_rfc = confusion_matrix(y_test, predict_test)
np.set_printoptions(precision=2)
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_rfc, classes=class_names, normalize=True,
title='Normalized confusion matrix')
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_rfc, classes=class_names,
title='Confusion matrix, without normalization')
plt.show()
#sklearn
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import auc
y_score_log = log_reg.predict_proba(X_test)[:, 1]
# Roc curve and ROC area
xclass, yclass, _ = roc_curve(y_test, y_score_log)
roc_auc_log = auc(xclass, yclass)
y_score_rfc = rfc.predict_proba(X_test)[:, 1] #try this if decision_function does not suffice
# ROC curve and ROC area
xclass2, yclass2, _ = roc_curve(y_test, y_score_rfc)
roc_auc_rfc = auc(xclass2, yclass2)
# Plot data and model
fig, ax = plt.subplots(figsize=(8, 8))
import seaborn as sns
ax.plot(xclass, yclass, alpha = 0.75, linestyle='-',
label=f'Log (AUC = {roc_auc_log:4.2f})')
ax.plot(xclass2, yclass2, alpha = 0.5, linestyle='-.',
label=f'Rfc (AUC = {roc_auc_rfc:4.2f})')
ax.plot([0, 0, 1], [0, 1, 1], alpha = 0.5,
lw=1, linestyle='-.', label='Perfect')
# Decorate plot appropriately
ax.set_title('Receiver Operating Characteristic Curve', fontsize=18)
ax.set_xlabel('False Positive Rate', fontsize=16)
ax.set_ylabel('True Positive Rate', fontsize=16)
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.set_aspect('equal')
ax.legend(loc=4, fontsize=16)
sns.despine(offset=5, trim=True)
plt.show()
# Function computes the gain from a given scikit-learn estimator and the test data features and labels
def compute_gain(mdl, d_test, l_test):
if hasattr(mdl, 'decision_function'):
# label for computing the gain.
prbas = mdl.decision_function(d_test)
pos_score = prbas
else:
# labels we need to compute the gain.
prbas = mdl.predict_proba(d_test)
pos_score = prbas[:,1]
# Generate class membership
clm = pd.get_dummies(l_test).as_matrix()
# Second column indicates 'positive' label
pos_lbl = clm[:,1]
# Compute total number of 'positive' labels
n_pos_lbl = np.sum(pos_lbl)
# Generate sorted (ascending) index array by 'positive' score
idx = np.argsort(pos_score)
# Now sort 'positive' labels by the sorted index
sort_pos = pos_lbl[idx[::-1]]
# Compute cumulative sum
cum_sum_pos = np.cumsum(sort_pos)
# lift is the ratio of cumulative improvement to all data
lift = cum_sum_pos/n_pos_lbl
# Number of test instances
num_test = l_test.shape[0]
# Amount of test data included as a percentage of total
tst_pcnt = np.arange(1,num_test + 1, 1) /num_test
return tst_pcnt, lift
#This is for the Gain chart
# Compute gains for different estimators
x_log, gain_log = compute_gain(log_reg, X_test, y_test)
x_rfc, gain_rfc = compute_gain(rfc, X_test, y_test)
# Plot
fig, ax = plt.subplots(figsize=(10, 10))
ax.plot(x_log, x_log, alpha = 0.5, linestyle='-',
label=f'Baseline')
ax.plot(x_log, gain_log, alpha = 0.75, linestyle='-.',
label=f'Log Gain')
ax.plot(x_rfc, gain_rfc, alpha = 0.75, linestyle=':',
label=f'Rfc Gain')
# Decorate plot appropriately
ax.set_title('Gain Chart', fontsize=18)
ax.set_xlabel('Test Data', fontsize=16)
ax.set_ylabel('Gain', fontsize=16)
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.set_aspect('equal')
ax.legend(loc=4, fontsize=16)
sns.despine(offset=5, trim=True)
plt.show()
Beyond the value of a performance metric, the model that we prefer is the Random Forest Classifier. The Logistic Regression method is one of the most common methods to use because of its similarities to linear regression, its efficiency, and tendency to avoid overfitting, but the Random Forest Classifier is a better model overall. Random Forest also avoids overfitting, and uses a random subset of features to produce invididual decision trees that are less sensitive to minor fluctuations. This attribute brings the most improvements to the Random Forest Classifier when making its predictions.
After reviewing the results, the probability of a full repayment was high with the Random Forest Classifier model having an 86% accuracy and the Logistic Regression model having an 88.8% accuracy score. Coincidentally, both models exhibited the same number of loan repayments, having 35 total repayments, 10 unrepaid loans, and an overall 77.78% loan repayment success rate. The features that were taken into account were 6 continuous variables and 3 discrete variables: 'ast_pymnt_amnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'int_rate', 'grade', 'home_ownership', 'term'. The numerical features chosen were selected based on the highest scores received from the SelectKBest feature selection method, and the categorical features were chosen based on the strong relationship seen between loan return and loan repaid in the Exploratory Data Analysis. Lastly, the classifier was better and more effective than choosing random features on the test data. The models have metrics that exemplify how the models are more accurate and provide a stronger sense of security as opposed to simply using random features on the test data.
Three visuals were used: the Confusion matrix, the Roc Curve/Auc, and the Gain Chart. The Confusion matrix, which was used to calculate the probability of repayment, showcases the probability of a repaid loan and non-repaid loan, while also including a normalized Confusion matrix. Additionally, the Roc Curve provides a visual on the effectiveness of each classication model against one another. Along with the Roc curve, the Area under the Curve (AUC) is a measure on the quality of the classification model. A perfect model has an AUC score of 1 while most models fall between 0.5 and 1. The Gain Chart is another visual that shows the results obtained with the model versus the results unobtained by the model.
Based on these features, we are confident Lending Tree can take advantage of the classification models to plan wisely and make better business decisions. By using the models' capabilities, Lending Tree can see how likely a person would repay back their loans.
We use the following regression models (with their corresponding R-squared scores) to predict the return on the loan based on six numerical features: last_pymnt_amnt, total_rec_int,out_prncp,recoveries,out_prncp_inv,int_rate.
Predictive power consists of two primary components: goodness of fit and predictive accuracy. Goodness of fit is training error, indicating how well a model predicts the data points that have been already used to estimate its parameters. Predictive accuracy, however, is testing error, measuring how well a model can predict new data points, for which the true value hasn't been seen. Since R2 can be used to quantify both goodness of fit and predictive accuracy, we believe in general the higher r2 score, the more precise predictions are generated. Out of regard for this factor, we prefer the following models:
But R-squared has several key limitations. R-sqaured cannot identify whether prediction and coeffient estimates are biased. Besides, R-squared does not indicate whether a regression model is adequate. Therefore, we will further analyze the residual plots. Recall that residual is unpredictable random part of each data point, we expect the residuals to be randomly scattered in the plot without showing any systematic pattern. That is, ideally, they are
We applied these three criterias to evaluate the residual plots of above regression models. Residual plots for Linear Regression model don't meet those requirements since the plots have outliers or they aren't evenly distributed vertically. While Adaboost Regression has a low R2 score, its residuals are similar to those of Random Forest Regression, Extremely Randomized Tree Regression as well as Pipeline regression, relatively to be scattered randomly in the plots, which indicate the good fitness of this model.
Based on the analysis above, we prefer to use Extereme Randomized Tree and Adaboost Regression model to predict the return.
Extreme Randomized Tree Regression Model generated the following predicted return statistics:
Adaboost Regression Model generated a smaller range of predicted return with lower standard deviation:
Based on above statistics, we belive the return on the loan is in the range of -100% to 62.4%. The average of return on the loan is in the range of -2.8% to 8.4%.
def make_res_plot(ind_test, dep_test, features, results,m):
fig, ax = plt.subplots()
ax.scatter(ind_test, dep_test - results, label='Testing Data')
ax.hlines(0, 0, m, color='r', alpha= 0.25)
ax.set_xlabel(features, fontsize=14)
ax.set_ylabel("Residual", fontsize=14)
ax.set_title("Regression Plot (model residuals)", fontsize=14)
return ax
def make_Reg_plt(reg_test,return_test,pred):
n=4000
return_test_rand = np.random.choice(return_test.reshape(-1),n)
pred_rand = np.random.choice(pred,n)
make_res_plot(np.random.choice(reg_test[:,0], n), return_test_rand, 'last_pymnt_amnt', pred_rand,35000)
make_res_plot(np.random.choice(reg_test[:,1],n), return_test_rand, 'total_rec_int', pred_rand,25000)
make_res_plot(np.random.choice(reg_test[:,2],n), return_test_rand, 'out_prncp', pred_rand,20000)
make_res_plot(np.random.choice(reg_test[:,3],n), return_test_rand, 'recoveries', pred_rand,20000)
make_res_plot(np.random.choice(reg_test[:,4],n), return_test_rand, 'out_prncp_inv', pred_rand,20000)
make_res_plot(np.random.choice(reg_test[:,5],n), return_test_rand, 'int_rate', pred_rand,50)
return
def reg_statistics(result):
print('Predicted maximum return= {:.1%}'.format(np.max(result)))
print('Predicted minimum return = {:.1%}'.format(np.min(result)))
print('Predicted average return = {:.1%}'.format(np.mean(result)))
print('Predicted std of return = {:.1%}'.format(np.std(result)))
return
reg_statistics(result)
model = LinearRegression(fit_intercept = True)
model.fit(reg_train,return_train)
result = model.predict(reg_test)
score = 100.0 * model.score(reg_test, return_test)
print(f'Multivariate LR Model score = {score:5.2f}%')
make_Reg_plt(reg_test,return_test,result.reshape(-1))
reg_statistics(result)
## Random Forest: Regression
regressor = RandomForestRegressor(random_state=23)
auto_model = regressor.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
## Random Forest Regression with Cross Validation
kf = KFold(n_splits=6, random_state=23)
scores = cross_val_score(regressor, reg_train, return_train, cv=kf)
mean_score=np.mean(scores)
print('CV Score = {:.1%}'.format(mean_score))
make_Reg_plt(reg_test,return_test,pred)
reg_statistics(pred)
## Extremely Randomized Trees: Regression
auto_model = ExtraTreesRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
make_Reg_plt(reg_test,return_test,pred)
reg_statistics(pred)
#Decision Tree Regression
auto_model = DecisionTreeRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
## Gradient Tree Boosting: Regression
auto_model = GradientBoostingRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
##AdaBoost Regression
auto_model = AdaBoostRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
make_Reg_plt(reg_test,return_test,pred)
reg_statistics(pred)
## Pipeline Regression
auto_model_p = RandomForestRegressor(random_state=23)
am_reg = Pipeline([('RFR', auto_model_p)])
am_reg.set_params(RFR__random_state=23)
am_reg.fit(reg_train, return_train)
print('Score = {:.1%}'.format(am_reg.score(reg_test, return_test)))
pred = auto_model_p.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
make_Reg_plt(reg_test,return_test,pred)
reg_statistics(pred)
## Lasso regression
def Lasso_r(X_train, y_train, X_test, y_test, alpha, random_state):
lasso = Lasso (alpha = alpha, random_state = random_state)
reg = lasso.fit(X_train, y_train)
r2 = reg.score(X_test, y_test)
return r2
print('Score = {:.1%}'.format(Lasso_r(reg_train, return_train, reg_test, return_test,1, 23)))
## lasso Graph
alphas = np.arange(1, 250, 1)
scores = []
for alpha in alphas:
scores.append(Lasso_r(reg_train, return_train, reg_test, return_test, alpha, 23))
plt.plot(alphas, scores, label="r2 score vs alpha")
plt.title("Lasso Regression", fontsize=18)
plt.legend(loc='best')